Report Overview

This report was created for an overview quality diagnosis of reportData data. It was created for the purpose of judging the validity of variables before conducting EDA.

Contents

Overview

Data Structures

Table 1: Data Structures and Types
division metrics value
size observations 19,189
size variables 19
size values 364,591
size memory size (MB) 4
duplicated duplicate observation 0
missing complete observation 8,956
missing missing observation 10,233
missing missing variables 9
missing missing values 20,784
division metrics value
data type numerics 2
data type integers 1
data type factors/ordered 11
data type characters 2
data type Dates 1
data type POSIXcts 2
data type others 0

Warnings

Table 2: Warnings in dataset and variables
warnings status recommand
company_type has 6,145 (32%) missing values missing judgement
company_size has 5,945 (31%) missing values missing judgement
gender has 4,519 (23.5%) missing values missing judgement
major_discipline has 2,815 (14.7%) missing values missing judgement
education_level has 461 (2.4%) missing values missing judgement
last_new_job has 424 (2.2%) missing values missing judgement
enrolled_university has 387 (2%) missing values missing judgement
experience has 65 (0.3%) missing values missing judgement
base_date has 23 (0.1%) missing values missing judgement
ids has high(1.00) cardinality, Maybe identifier cardinality check
test has constant value “0” cardinality remove
base_date3 has constant value “2021-06-12 09:00:00” cardinality remove
test has 19,189 (100%) zeros zero check
city_dev_index has 13 (0.07%) zeros zero check
city_dev_index has 6 (0.03%) negatives negative check
training_hours has 986 (5.14%) outliers outlier judgement
city_dev_index has 36 (0.19%) outliers outlier judgement

Variables

Table 3: List of Variables Diagnosis
variables types missing cardinality zero minus outlier
enrollee_id character > high
city factor
city_dev_index numeric X X X
gender factor X
relevent_experience factor
enrolled_university factor X
education_level ordered X
major_discipline factor X
experience ordered X
company_size ordered X
company_type factor X
last_new_job ordered X
training_hours integer X
job_chnge factor
test numeric constant X
ids character identifier
base_date Date X
base_date2 POSIXct
base_date3 POSIXct constant

Missing Values

List of Missing Values

Table 4: List of Variables Diagnosis
variables missing_count missing (%) status recommand
company_type 6,145 32% Bad Model based Imputation
company_size 5,945 31% Bad Model based Imputation
gender 4,519 23.5% Bad Model based Imputation
major_discipline 2,815 14.7% NotBad Model based Imputation
education_level 461 2.4% Good Delete or Imputation
last_new_job 424 2.2% Good Delete or Imputation
enrolled_university 387 2% Good Delete or Imputation
experience 65 0.3% Good Delete or Imputation
base_date 23 0.1% Good Delete or Imputation

Visualization


Unique Values

Categorical Vaiables

Variables where the proportion of unique data is more than 0.5 or unique is 1.
Table 5: Detail warning categorical cardinality
variables types unique unique (%) status recommand
enrollee_id character 19,158 99.8% high cardinality Judgment
ids character 19,189 100% identifier Use as ID
base_date3 POSIXct 1 0% constant Remove Variable

Numerical Variables

Variables where the unique cases is less than 5 or unique is 1.
Table 6: Detail warning numerical cardinality
variables types unique unique (%) status recommand
test numeric 1 0% constant Remove Variable

Categorical Variable Diagnosis

Top Ranks

Table 7: Top 10 levels of categorical variables
variables levels freq ratio (%)
base_date 2021-06-26 700 3.6
base_date 2021-07-10 688 3.6
base_date 2021-06-19 684 3.6
base_date 2021-07-03 671 3.5
base_date 2021-06-15 662 3.4
base_date 2021-07-09 662 3.4
base_date 2021-06-13 661 3.4
base_date 2021-06-21 655 3.4
base_date 2021-07-12 654 3.4
base_date 2021-06-16 645 3.4
base_date Other levles 12,484 65.1
base_date Missing 23 0.1
base_date2 2021-06-12 09:00:20 696 3.6
base_date2 2021-06-12 09:00:06 693 3.6
base_date2 2021-06-12 09:00:12 692 3.6
base_date2 2021-06-12 09:00:23 672 3.5
base_date2 2021-06-12 09:00:19 664 3.5
base_date2 2021-06-12 09:00:25 664 3.5
base_date2 2021-06-12 09:00:14 660 3.4
base_date2 2021-06-12 09:00:24 660 3.4
base_date2 2021-06-12 09:00:28 656 3.4
base_date2 2021-06-12 09:00:03 646 3.4
base_date2 Other levles 12,486 65.1
base_date3 2021-06-12 09:00:00 19,189 100.0
city city_103 4,361 22.7
Table 7: Top 10 levels of categorical variables (continued)
variables levels freq ratio (%)
city city_21 2,710 14.1
city city_16 1,535 8.0
city city_114 1,338 7.0
city city_160 848 4.4
city city_136 586 3.1
city city_67 431 2.2
city city_102 305 1.6
city city_75 305 1.6
city city_104 301 1.6
city Other levles 6,469 33.7
company_size 50-99 3,090 16.1
company_size 100-499 2,578 13.4
company_size 10000+ 2,022 10.5
company_size 10-49 1,474 7.7
company_size 1000-4999 1,331 6.9
company_size <10 1,308 6.8
company_size 500-999 878 4.6
company_size 5000-9999 563 2.9
company_size Missing 5,945 31.0
company_type Pvt Ltd 9,838 51.3
company_type Funded Startup 1,002 5.2
company_type Public Sector 957 5.0
company_type Early Stage Startup 605 3.2
company_type NGO 521 2.7
company_type Other 121 0.6
company_type Missing 6,145 32.0
education_level Graduate 11,616 60.5
education_level Masters 4,371 22.8
education_level High School 2,017 10.5
Table 7: Top 10 levels of categorical variables (continued)
variables levels freq ratio (%)
education_level Phd 415 2.2
education_level Primary School 309 1.6
education_level Missing 461 2.4
enrolled_university no_enrollment 13,839 72.1
enrolled_university Full time course 3,763 19.6
enrolled_university Part time course 1,200 6.3
enrolled_university Missing 387 2.0
enrollee_id 16814 2 0.0
enrollee_id 18272 2 0.0
enrollee_id 19249 2 0.0
enrollee_id 20866 2 0.0
enrollee_id 20881 2 0.0
enrollee_id 21563 2 0.0
enrollee_id 21634 2 0.0
enrollee_id 22899 2 0.0
enrollee_id 23825 2 0.0
enrollee_id 24936 2 0.0
enrollee_id Other levles 19,169 99.9
experience >20 3,293 17.2
experience 5 1,434 7.5
experience 4 1,405 7.3
experience 3 1,359 7.1
experience 6 1,218 6.3
experience 2 1,132 5.9
experience 7 1,029 5.4
experience 10 986 5.1
experience 9 982 5.1
experience 8 802 4.2
experience Other levles 5,484 28.6
Table 7: Top 10 levels of categorical variables (continued)
variables levels freq ratio (%)
experience Missing 65 0.3
gender Male 13,241 69.0
gender Female 1,238 6.5
gender Other 191 1.0
gender Missing 4,519 23.5
ids ID1 1 0.0
ids ID10 1 0.0
ids ID100 1 0.0
ids ID1000 1 0.0
ids ID10000 1 0.0
ids ID10001 1 0.0
ids ID10002 1 0.0
ids ID10003 1 0.0
ids ID10004 1 0.0
ids ID10005 1 0.0
ids Other levles 19,179 99.9
job_chnge No 14,406 75.1
job_chnge Yes 4,783 24.9
last_new_job 1 8,054 42.0
last_new_job >4 3,296 17.2
last_new_job 2 2,903 15.1
last_new_job never 2,458 12.8
last_new_job 4 1,030 5.4
last_new_job 3 1,024 5.3
last_new_job Missing 424 2.2
major_discipline STEM 14,518 75.7
major_discipline Humanities 670 3.5
major_discipline Other 382 2.0
major_discipline Business Degree 328 1.7
Table 7: Top 10 levels of categorical variables (continued)
variables levels freq ratio (%)
major_discipline Arts 253 1.3
major_discipline No Major 223 1.2
major_discipline Missing 2,815 14.7
relevent_experience Has relevent experience 13,814 72.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0
relevent_experience No relevent experience 5,375 28.0

Numerical Variable Diagnosis

Distributions

Table 8: General list of numerical diagnosis
variables min Q1 mean median Q3 max zero minus outlier
city_dev_index -0.5 0.74 0.83 0.9 0.92 0.95 13 6 36
training_hours 1.0 23.00 65.36 47.0 88.00 336.00 0 0 986
test 0.0 0.00 0.00 0.0 0.00 0.00 19,189 0 0

Zero Values

Table 9: List of numerical diagnosis (zero)
variables min median max zero zero (%)
test 0.0 0.0 0.00 19,189 100.0
city_dev_index -0.5 0.9 0.95 13 0.1

Minus Values

Table 10: List of numerical diagnosis (minus)
variables min median max minus minus (%)
city_dev_index -0.5 0.9 0.95 6 0

Outliers

List of Outliers

Table 11: Diagnosis of numerical variable outliers
variables min median max outlier outlier (%)
training_hours 1.0 47.0 336.00 986 5.1
city_dev_index -0.5 0.9 0.95 36 0.2

Individual Outliers

variable: training_hours

Table 12: training_hours
Measures Values
Outliers count 986
Outliers ratio (%) 5.14%
Mean of outliers 247.5335
Mean with outliers 65.35984
Mean without outliers 55.49206


variable: city_dev_index

Table 12: city_dev_index
Measures Values
Outliers count 36
Outliers ratio (%) 0.19%
Mean of outliers 0.1282222
Mean with outliers 0.8278187
Mean without outliers 0.8291337